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Abstract 

Background: The genomic information which is transcribed into the primary RNA can be altered by RNA editing at 
the transcriptional or post-transcriptional level, which provides an effective way to create transcript diversity in an 
organism. Altering can occur through substitutional RNA editing or via the insertion or deletion of nucleotides 
relative to the original template. Taking advantage of recent high throughput sequencing technology combined 
with bioinformatics tools, several groups have recently studied the genome-wide substitutional RNA editing profiles 
in human. However, while insertional/deletional (indel) RNA editing is well known in several lower species, only 
very scarce evidence supports the existence of insertional editing events in higher organisms such as human, and 
no previous work has specifically focused on indel differences between RNA and their matching DNA in human. 
Here, we provide the first study to examine the possibility of genome-wide indel RNA-DNA differences in one 
human individual, NA12878, whose RNA and matching genome have been deeply sequenced. 

Results: We apply different computational tools that are capable of identifying indel differences between RNA 
reads and the matching reference genome and we initially find hundreds of such indel candidates. However, with 
careful further analysis and filtering, we conclude that all candidates are false-positives created by splice junctions, 
paralog sequences, diploid alleles, and known genomic indel variations. 

Conclusions: Overall, our study suggests that indel RNA editing events are unlikely to exist broadly in the human 
transcriptome and emphasizes the necessity of a robust computational filter pipeline to obtain high confidence 
RNA-DNA difference results when analyzing high throughput sequencing data as suggested in the recent 
genome-wide RNA editing studies. 

Keywords: Indel RNA-DNA differences, RNA-seq data analysis, Computational filtering 



Background protein diversity with limited primary RNA transcripts 

RNA is an important biomolecule that is deeply involved in an organism [2-4]. Alteration can occur through the 

in almost all aspects of molecular biology, such as pro- insertion or deletion of nucleotides relative to the ori- 

tein production, gene regulation, and viral replication ginal template (insertional/deletional or "indel" RNA 

[1]. In order to perform such a variety of functions, the editing), or via substitutional RNA editing, in which one 

primary RNA transcripts need to be extensively pro- nucleotide is replaced by or changed to another, 

cessed. By changing the genomically encoded sequence The most common type of known RNA editing in 

at the transcriptional or post-transcriptional level, RNA metazoans involves conversion of adenosine to inosine 

editing provides an effective way to create transcript and (A-to-I editing) [5], which is mediated by adenosine dea- 
minase acting on RNA (ADAR) enzymes [6-8]. Inosine 
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ting in mRNA can alter the genetic information stored 
in the primary sequence, leading to changes in protein- 
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large number of A-to-I editing events have been identified 
in the human transcriptome by genome-wide bioinforma- 
tics and high throughput sequencing studies [9-14]. While 
both coding and non-coding sequences undergo A-to-I 
editing, it has been found that editing occurs mainly in re- 
petitive sequences which are located within 5' or 3' un- 
translated regions (UTRs) or introns [9-14]. 

Taking advantage of whole-genome and transcriptome 
deep-sequencing technologies, recent studies have ex- 
tensively investigated all the potential types of substitu- 
tional RNA editing in the human transcriptome using 
bioinformatics tools that are capable of identifying mis- 
matches between RNA reads and the matching reference 
genome [15-17]. While the validity of some of the results 
in [15] is currently under debate [18-21], these studies 
revealed a large number of substitutional RNA editing 
candidate sites including many A-to-I editing events. It 
is now obvious that combining high throughput sequen- 
cing and bioinformatics has the ability to identify RNA 
editing events that occur at a single nucleotide level 
across the whole transcriptome. 

Insertional and deletional editing events have been dis- 
covered in various species [2,4], such as U insertions and 
deletions in kinetoplastids [22], G and A insertions in 
paramyxoviruses [23], and various types in Myxomycota 
[24]. Most of these events are found in mRNA 
sequences, with certain functions like creating new start 
and stop codons by uridine insertions in kinetoplastids 
[25], creating new open reading frames by nucleotide 
insertions in kinetoplastid [26] and Physarum [27] mito- 
chondria, and frameshifting between alternative ORFs in 
paramyxoviruses [23]. In higher organisms, no indel 
editing events have been identified until recently 
Zougman et al [28] reported two insertional RNA edit- 
ing events in human: according to their data in the 
5'UTRs of the linker histone HI mRNA and of the high- 
mobility group (HMG) mRNA, a single uridine each 
inserts between an A and a G, creating new translation 
start sites and producing N-terminally extended proteins. 
However, to our knowledge no follow-up work has been 
done concerning these editing sites and no additional 
indel editing events have been reported in human since. 

In this study, we explore the possibility of indel RNA 
editing events across the transcriptome in human by sys- 
tematically examining the variations between RNA-seq 
reads and their matching genome. Specifically, we exam- 
ine the possibility of genome-wide indel RNA-DNA dif- 
ferences in one human individual, NA12878. We apply 
different computational tools that use gapped alignments 
to identify indel differences between RNA reads and the 
matching genome. While hundreds of such indel candi- 
dates are revealed after initial selection, further analysis 
and filtering indicate that all of them are false positives 
which result from incorrect alignments including splice 



junctions, paralog sequences, diploid alleles, and known 
genomic indel variations. The results from our study 
suggest that indel RNA editing events are unlikely to 
exist widely in the human transcriptome and emphasize 
the importance of thorough filtering in genome-wide 
studies of RNA editing. 

Results 

Workflow 

Figure 1 shows the general workflow which we follow in 
our analysis. In short, we first align RNA-Seq reads 
against their matching genomic sequence. Once suitable 
RNA-Seq alignments are generated, RNA sequences that 
are different from the corresponding DNA sequence are 
identified as initial candidates. These initial candidates 
are subjected to multiple filters with stringent thresholds 
to eliminate false positives. The results as well as the 
considerations taken in developing the approach are 
described in more detail below. 

Mapping of RNA-seq reads yields hundreds of candidate 
indel RNA-DNA differences 

The initial step in the detection of RNA-DNA differ- 
ences is the accurate mapping of RNA-seq reads to their 
matching reference genome. Failure to correctly assign a 
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Figure 1 General Workflow to identify reliable indel RNA-DNA 
differences in the human transcriptome. 
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read to its original location may lead to spurious align- 
ments that may be misinterpreted as editing events. 
Since a large number of genomic variations including 
single nucleotide differences and indels exist among dif- 
ferent individuals, it is crucial to compare RNA and 
DNA sequences from the same background (to verify 
the importance of using the same background we in fact 
also performed our analysis using the hgl9 reference 
genome and found as expected a large number of ge- 
nomic variations reported as false positive results). Based 
on the reference genome (NCBI build 36) and incorpo- 
rating genomic variations and structural variations iden- 
tified by the 1000 Genome pilot project [29], the 
Gerstein lab has recently created a version of the diploid 
genome sequence for the NA12878 individual from the 
lymphoblastoid cell line GM12878 [30]. Matching deeply 
sequenced RNA-Seq data sets for the same cell line are 
also available. We use this assembled genome to identify 
possible insertional and deletional RNA-DNA differ- 
ences in NA12878, by directly aligning RNA-seq reads 
against their matching genome. Since the assembled 
genome is a diploid one which contains maternal and 
paternal haplotypes with small variations in sequences, 
we first map RNA-seq reads to the maternal genome 
and list all the potential candidates and then remove the 
candidates resulting from maternal-paternal genome va- 
riations (see below). The detailed information of RNA-seq 
data and diploid genome for GM12878 we used in this 
study are described in the Methods section. 

The rapid emergence of high-throughput sequencing 
techniques has resulted in the development of a variety 
of short sequence read mappers that are based on differ- 
ent alignment strategies. Since our goal is to identify 
indels within all RNA-DNA differences events, the basic 
requirement for the mapping tools is that indels should 



be allowed when aligning short RNA reads to the refer- 
ence. By evaluating most of the currently available map- 
ping tools, we find that BFAST (Blat-like Fast Accurate 
Search Tool) [31] is one of the most suitable softwares 
for our indel analysis. In contrast to some other algo- 
rithms that speed up the mapping process by ignoring 
errors and indels, BFAST is very sensitive to errors, 
single-nucleotide polymorphisms (SNPs) and especially 
indels with a considerably fast mapping speed [31]. Since 
mapping bias inherent to the mapping algorithm may 
affect results, we also use another tool, bowtie2 [32], a 
fast and accurate mapping algorithm in which gapped 
alignment is allowed and compare the results. 

For the initial BFAST mapping (alignment settings are 
described in detail in the Methods section), out of the 
113,902,864 reads, 79,833,200 could be mapped to the 
assembled maternal haploid genome of GM12878 over 
their entire length. For bowtie2 (using the default setting 
suggested in the manual) a total of 40,862,987 reads 
could be mapped to the assembled maternal genome 
over their entire length. This mapping ratio is signifi- 
cantly lower than that in BFAST, which is probably due 
to the higher stringency of bowtie2 for mapping a read 
to the reference genome. 

The mapping output for BFAST and bowtie2 are SAM 
(Sequence Alignment/Map) files, which were processed 
by the SAMtools software package [33], a package that 
was originally designed to identify genomic variations. We 
conduct initial indel variant calling taking advantage of 
the mpileup algorithm implemented in SAMtools, using 
the default settings used for calling genomic SNPs except 
that we do not require "heterozygotes" to reach 50% read 
support since editing could occur at lower frequency. In 
order to minimize the influence of sequencing and reverse 
transcription errors, candidates are required to pass 



A Maternal genome, chromosome 1, 58848163-58848227 



Maternal 

Paternal 

Readl 

Read2 

Read3 



CTCCTCGTGGTAGGAACGGCC ACTCTTGGGGTG \ - 1 CAGCCTTCCTGTGATTCTTCGGATCAGCAG 
CTCCTCGTGGTAGGAACGGCCACTCTTGGGGTG \M CAGCCTTCCTGTGATTCTTCGGATCAGCAG 

CGCCCACTCTTGGGGTG \A1 CAGCCTTCCTGTG 

C ACTCTTGGGGTG HA1 CAGCCTTCCTGTGATTA 

CTTGGGGTG HA1 CAGCCTTCCTGTGATTCTTCG 

********** * ************* 



B Maternal genome, chromosome 1, 222528110-222528174 



Maternal 

Paternal 

Readl 

Read2 

Read3 

Read4 

Read5 



TTAAGGCCAAATGTGTTTTTCTTTCTTCTCCCC IT - C 
TTAAGGCCAAATGTGTTTTTCTTTCTTCTCCCCTTTC 

CAAATGTGTTTTTCTTTCTTCTCCCC FT - C 

GTGTTTTTCTTTCTTCTCCCC 1TTC 

GTTTTTCTTTCTTCTCCCC FT - C 

TTTTTCTTTCTTCTCCCC FT - C 

TTTTCTTTCTTCTCCCC 1TTC 



*****************£* * £ * * * 



iCTGACTTTTCCCTTTCTCCCATCCCCGA 
:CTGACTTTTCCCTTTCTCCCATCCCCGA 

;CTG 

iCTGACTT 

iCTGACTTTTC — 

iCTGACTTTTCC - 

iCTGACTTTTCC 



Figure 2 Examples of false positives resulting from diploid alleles. (A.) All the RNA reads support the paternal genome version; (B.) Some of 
the RNA reads support the maternal genome version while others support the paternal genome version. Note: only part of the RNA reads 
mapping to the position are shown here. 
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quality control thresholds for base calling quality, read 
mapping quality (reliability of the alignment across the 
genome), read coverage, variant/reference quality, and 
indel type and size (see details in the Methods section). 

After these initial selections, 685 candidates remain in 
the BFAST results while 250 candidates remain in the 
bowtie2 results. Of these, 110 were shared between the 
two mapping approaches. The fact that bowtie2 has 
much fewer candidates than BFAST is probably due to 
the lower read mapping ratio mentioned above. As for 
the candidates found by bowtie2 but not by BFAST, 
many are at the edge of the filtering thresholds. Thus, 
small differences in the way quality measures and va- 
riants of aligned reads are reported in the two mappers 
lead to a candidate being just above the threshold in one 
method and just below in the other. We notice, that as 
indicated below all these questionable reads are filtered 
out by the additional false positive filtering steps and the 
overlap between remaining candidates based on BFAST 
alignments and remaining candidates based on bowtie2 
alignments is much larger. 

Careful filtering reveals that all indel editing candidates 
are false positives 

After the initial selections, the list of indel variations 
called by SAMtools may still contain a large number of 



false positives that are unrelated to indel RNA editing. 
These false positives may be a result of known genomic 
variations, different alleles from diploid genomes, and 
misalignment of reads due to, e.g., splice junctions and 
paralog sequences in the genome. We thus apply a series 
of stringent filters to remove false positives from our 
candidate lists. 

Since we initially aligned RNA-seq reads to the mater- 
nal genome, we first check if the candidate indel RNA- 
DNA differences result from genuine maternal-paternal 
genome differences by realigning all the reads that can 
be mapped to the putative indel sites (including reads 
that support editing and reads that support the maternal 
genome version) to the paternal genome using BFAST. 
This filter identifies 45 sites in the BFAST dataset as well 
as 60 in the bowtie2 dataset that reflect genuine 
maternal-paternal genome variations where the apparent 
indels in the RNA-seq reads reflect the paternal RNA 
transcript form (see Figure 2 for two examples). 

Next, we address possible misalignment due to splice 
junctions. We use BLAST [34] to search the genomic 
surrounding sequence (without the indel) from 32bp up- 
stream to 32bp downstream of each remaining indel 
candidate against the reference genome (NCBI build 37). 
In addition, we search these surrounding sequences 
(with and without indels, respectively) against all the 



Maternal genome, chromosome 1, 45789593-45789657 



B 



Maternal 

Paternal 

Reference 

Readl 

Read2 

Read3 

Exonseq 



AACATAGCTGAAACCTCTGCTTCTC TCACCTGG-CAG GTAAAAGCAGCTGTTAAGTATGCCCTTAG 
AACATAGCTGAAACCTCTGCTTCTC TCACCTGG - C AG GTAAAAGCAGCTGTTAAGTATGCCCTTAG 
AACATAGCTGAAACCTCTGCTTCT(jTCACCTGG-qAqGTAAAAGCAGCTGTTAAGTATGCCCTTAG 

- - -CTGGTCAGGTAAAAGCAGCTGTTAAGTATGCCC 

- G CCTGGTCAGGTAAAAGCAGCTGTTAAGTATGC 

- GAG CCTGGTCAGGTAAAAGCAGCTGTTAAGTAT 

GATTGGTCTGGGTACCTGGAAGAG1 GAGCCTGGTCAGGTAAAAGCAGCTGTTAAGTATGCCCTTAG 



>n ref | NM 153326.2] L ' 4*1 H I Homo sapiens aldo-keto reductase family 1, member Al (aldehyde 



_ hm.'.i i 

reductase) (AKR1A1), transcript variant 2, mRNA 
Length=1469 

GENE ID: 1Q327 AKR1A1 | aldo-keto reductase family 1, member Al (aldehyde 
reductase) [Homo sapiens] (Over 16 PubMed links) 



Score = 63.9 bits (34), 
Identities = 37/38 (97%) 
Strand=Plus/Plus 



Query 
Sbjct 



Expect = 3e-10 
Gaps = 1/38 (3%) 



29 CCTGG - CAGGTAAAAGCAGCTGTTAAGTATGCCCTTAG 

455 WMAAIMiMAMM™ 



65 

492 



C RefSeq_RNA 
Readl 
Read2 
Read3 



TGAGCCTGGTCAGGTAAAAGCAGCTGTTAAGTATGCCC 

CTGGTCAGGTAAAAGCAGCTGTTAAGTATGCCC 

- - -GCCTGGTCAGGTAAAAGCAGCTGTTAAGTATGC - - 
-GAGCCTGGTCAGGTAAAAGCAGCTGTTAAGTAT 



Figure 3 An example of a false positive resulting from a splice junction. (A.) Multiple alignment of RNA-seq reads and genomic DNA 
(maternal, paternal genome of NA12878 and NCBI build 37 reference genome) Note: only part of the RNA reads mapping to the position are 
shown here. The blue box shows the mismatches between RNA-seq reads and the reference genome, but these "mismatches" can be correctly 
mapped to the exonic sequence. (B.) BLAST results of the alignment of the genomic version of the sequence to the refSeq RNA (only part of the 
sequence can be properly aligned). A further investigation of the mRNA annotations show that the AG in the genomic sequence (marked by a 
red rectangle in (A)) corresponds to a 3' splice junction. The AG in the reads at that position in the alignment is not the actual 3' splice signal but 
an AG in the spliced mRNA that happens to follow the intron and thus can be mapped to the genomic AG splice signal. (C.) The reads which 
support the "indel" can be fully aligned to the same refSeq RNA sequence as in (B). 
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currently known splice junctions including the refSeq 
RNA database at NCBI, Gencode, Ensembl and UCSC 
genes [17]. We find that a substantial number of initial 
candidates is due to incorrect mappings of reads across 
splice junctions, based on the following characteristics: 
(1) The genome version (surrounding sequence without 
indels) aligns to the expected position in the reference 
genome without indels; (2) Only part of the sequence 
from the genome version can be properly aligned with- 
out indels to one of the RNA sequences in the splice 
junction database (this RNA sequence corresponds to 
the same position in the reference genome as found in 
(1), but contains a splice junction close to the putative 
indel site); (3) The surrounding sequence which includes 
the indel can be fully aligned to the same splice junction 
RNA sequence identified in (2) without any gaps (see ex- 
ample shown in Figure 3). Further examinations of these 



alignments reveal that most of these putative indels are 
accompanied by one or more mismatches between ge- 
nomic DNA and RNA-seq reads. Altogether, 609 of the 
candidate indels in the BFAST dataset are identified as 
false positives by this filter while in the bowtie2 dataset 
169 result from misaligned splice junctions. 

As suggested in [18], potential paralog sequences may 
lead to spurious RNA-DNA difference events and thus 
also need to be ruled out. Therefore, we investigate if 
some of the candidate indel RNA-DNA differences are 
signatures of paralog sequences in the genome, by rea- 
ligning the indel sites and surrounding sequences back 
to the entire reference genome (NCBI build 37) using 
BLAST [34]. If an alignment without indels can be 
found, we declare the candidate as resulting from a mis- 
aligned paralog (see example in Figure 4). The reason 
that we can obtain mappings to different locations when 



Maternal genome, chromosome 14, 81438746-81438810 



A 



Maternal AACATCAACCGACAGTTGGAGGTATACACAAGC G - <\GGTGACCCTGAGAGTGTGGCTGGGGAGATG 



Paternal AACATCAACCGACAGTTGGAGGTATACACAAGC 

Readl TGGAGGTATACACAAGC 

Read2 AGGTATACACAAGC 

Read3 ACACAAGC 



g - !\ggtgaccctgagagtgtggctggggagatg 

gg^ggtgaccctgaga 

gg \ggtgaccctgagagtg 

ggJ\ggtgaccctgagagtgtggctg 



B > ref I NT 026437 . 12 1 13 Homo sapiens chromosome 14 genomic contig, GRCh37.p5 Primary 
Assembly 
Length=88289540 

Features flanking this part of subject sequence: 

383189 bp at 5 side: protein sel-1 homolog 1 precursor 

3704518 bp at 3' side: leucine- rich repeat transmembrane protein FLRT2 precursor 

Score = 121 bits (65), Expect = 8e-26 
Identities = 65/65 (100%), Gaps = 0/65 (0%) 
Strand=Plus/Plus 

Query 1 AACATCAACCGACAGTTGGAGGTATACACAAGCGAGGTGACCCTGAGAGTGTGGCTGGGG 60 

Sbjct 63383277 63383336 
Query 61 AGATG 65 

Sbjct 63383337 63383341 

G >c ref J NT 011520. 12 1 13 Homo sapiens chromosome 22 genomic contig, GRCh37.p5 Primary 
Assembly 
Length=29755346 



Sort alignments for this subject sequence by: 
E value Score Percent identity 
Query start position Subject start position 

Features in this part of subject sequence: 
eukaryotic translation initiation factor 3 subunit L isof... 
eukaryotic translation initiation factor 3 subunit L isof... 



Score - 69.4 bits (37), Expect = 3e-10 
Identities = 40/41 (98%), Gaps = 1/41 (2%) 
Strand=Plus/Plus 

Query 1 AACATCAACCGACAGTTGGAGGTATACACAAGcETVgGTGA 40 

Sbjct 17656887 17 

Features in this part of subject sequence: 
eukaryotic translation initiation factor 3 subunit L isof. 
eukaryotic translation initiation factor 3 subunit L isof. 

Score = 52.8 bits (28), Expect = 3e-05 
Identities = 31/32 (97%), Gaps = 1/32 (3%) 
Strand=Plus/Plus 

Query 35 AGGTGACCCTGAGAGTGTGGCTGGGGAG - ATG 65 

Sbjct 17660944 17660975 



Figure 4 An example of false positive resulting from a paralog sequence. (A.) Multiple alignment of RNA-seq reads and genomic DNA 
(maternal and paternal genome) Note: only part of the RNA reads mapping to the position are shown here. (B.),(C) BLAST results of the 
alignment of the genomic version of the sequence to the reference genome (NCBI build 37). It can be aligned to the reference genome at more 
than one position. The gap highlighted in (C) by the red rectangle is the same as the indel difference shown in (A). 
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aligning surrounding sequence to the reference genome 
may be due to the inherent bias in the mapping tools or 
gaps in the assembly of the NA 12878 genome compared 
to the reference genome. This filter eliminates 19 candi- 
dates among those identified by BFAST and 10 candi- 
dates among those identified by bowtie2. 

After eliminating false positives from incorrect align- 
ment in the same individual described above, only 
12 candidates remain in the BFAST dataset (see Figure 5) 
and 11 in the bowtie2 dataset within 8 sites common to 
both. However, these candidates are listed as genomic 
indel variations in the SNP database. We thus conclude 
that these most likely represent misassemblies in the 
maternal and/or paternal genome rather than true indel 
RNA-DNA differences. Based on this filtering analysis, 
we conclude that none of the candidates from the initial 
lists can pass all of the filters. 

Sensitivity of computational pipeline 

In order to conclude that our results indeed imply that 
indel editing is rare rather than being a result of the in- 
ability of our pipeline to find indel differences that are 
present, we tested the sensitivity of our computational 
pipeline, by examining how many of the known genomic 
indels in the NA12878 diploid genome can be found by 
our pipeline before furthering filtering. We first align the 
NA12878 maternal genome sequences against the pater- 
nal genome sequences to locate positions of all the short 
indels. For all of these sites, we ranked them according 
to the reads coverage on the site. We found that for the 



top 100 expressed genomic indel sites (which have read 
coverages down to 5 reads) 44 sites are found when 
using the maternal genome for indel calling (where we 
excluded sites with homopolymer runs of greater than 
5 bp which have a higher chance to result from sequen- 
cing errors rather than true indel differences). If indel 
calls from alignments to the maternal and the paternal 
genome are combined, we found nearly 90% of the co- 
vered genomic indels. We thus conclude that the lack of 
indel editing sites found in our study is not due to a lack 
in sensitivity of the pipeline. 

Additional RNA-seq datasets yield consistent results 

In order to study the dependence of these results on 
technical details such as read length, we examine other 
RNA-seq data sets from a recent study [30] on the same 
cell line GM 12878 (for a detailed data description see 
the Methods section) in addition to the RNA-seq data 
described above. We perform the same analysis as 
described above for the 54bp single-end short reads in 
this data set but limit ourselves to BFAST only since this 
produced more candidates on the first data set. The 
results are summarized in Table 1. We find that 1037 
sites pass the initial selection (197 of which overlap with 
the candidate lists from the first data set). Applying the 
different filters again identifies all of them as belonging to 
one of the four categories: genuine paternal allele, mis- 
mapped splice junction, mismapped paralog sequence, 
and genomic variation annotated in the SNP database. 



A Maternal genome, chromosome 1, 143729398-143729462 



Maternal 

Paternal 

Reference 

Readl 

Read2 

Read3 

Read4 

Read5 



CAAGCCATATATATTAGGGAATAGTAGATTGTTA 
CAAGCCATATATATTAGGGAATAGTAGATTGTTA 
CAAGCCATATATATTAGGGAATAGTAGATTGTTA 



- TAGGGAATAGTAGATTGTT AAT 1TCG 



-TfTCGTTTTTTCCCTCCCAGTGCATTTTAAA 
- T FTC GTTTTTTC C CTC C C AGTG C AT TTTA AA 
- T FTC GTTTTTTC C CTC C C AGTG C AT TTTA AA 

- TTAGGG AATAGTAGATTGTTlAATFTCGTTTTTT 

GTTTTTTC 

-GATTGTTlAATtTTCGTTTTTTCCCTCCCAGTGCA 

- GTT AAT rTCGTTTTTTCCCTCCCAGTGCAT ATT - - - 
- GT7|AA7|TTC GTTTTTTC C CTC C C AGTG CAT TTT - - - 



* * ********* 



B Maternal genome, chromosome 1, 295892-295956 



Maternal 

Paternal 

Reference 

Readl 

Read2 

Read3 

Read4 

Read5 

Read6 

Read7 

Read8 

Read9 

ReadlO 

Readll 



TTAGGCTGGTGCTGCCAAAAAGAAAAGCAACA rA-G \GTTTAAGTATCCAGTAGTGATTTGTAAAC 
TTAGGCTGGTGCTGCCAAAAAGAAAAGCAACA TA-G \GTTTAAGTATCCAGTAGTGATTTGTAAAC 
TTAGGCTGGTGCTGCC AAAAAGAAAAGCAAC A fA-G^GTTTAAGTATCCAGTAGTGATTTGTAAAC 



- CTGGTGCTGACAAAAAGAAAAGCAACA TAAG \G - - 

- - TGGTGCTGCC AAAAAGAAAAGCAAC A TAAG <\GT - 
- -TGGTGCTGCCAAAAAGAAAAGCAACA TA-GAGTT 

- - -GGTGCTGCCAAAAAGAAAAGCAACA TAAG !\GTT 



- GTGCTGCC AAAAAGAAAAGCAAC A TA - G f\GTTTA - 
- - TGCTGCCAAAAAGAAAAGCAACA TAAG AGTTTA - 
- -TGCTGAC AAAAAGAAAAGCAAC A fA-G AGTTTAA 



GAAAAGCAAC A TAAG AGTTTAAGTATCCAGTAG 

GAAAAGCAAC A FA-G AGTTTAAGTATCCAGTAGT 

GCAACA TAAG AGTTTAAGTATC C AGTAGTG ATT - 



GCAACA fA-G AGTTTAAGTATC C AGTAGTG ATTT - 



Figure 5 Alignment of two of the 12 remaining candidates. These two candidates pass all the other filters but are found in the SNP database 
(rs56026824 in (A) and rs72551 074 in (B)). The figure shows the multiple alignments of RNA-seq reads and genomic DNA (maternal, paternal 
genome of NA12878 and NCBI build 37 reference genome) Note: only part of the RNA reads mapping to the position are shown here. 
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Table 1 Summary for indel candidates analysis of 



additional RNA-seq datasets 



Initial candidates 


1037 


Paternal alleles 


95 


Splice junctions 


907 


Paralog sequences 


10 


Known genomic variations 


25 



Discussion 

In this work, we provide the first systematic study of the 
possibility of genome-wide indel RNA-DNA differences 
in one human individual, NA12878, whose RNA and 
matching genome have been deeply sequenced. We ap- 
plied different computational tools that are capable of 
identifying indel differences between RNA reads and the 
matching reference genome. After initial selection using 
SAMtools, we found hundreds of such indel candidates. 
However, with careful further analysis and filtering, we 
found that all of them are false-positive results such as 
splice junctions, paralog sequences, different alleles from 
diploid genomes, and known genomic indel variations 
from the SNP database. We thus conclude that there is 
no evidence for widespread insertional or deletional 
RNA editing in the human genome. 

However, it should be noticed that the RNA-seq data 
sets we used are from a particular lymphoblastoid cell 
line; it is thus in principle still possible that widespread 
indel RNA editing events could be cell type specific and 
that we may have missed them by selectively focusing on 
the lymphoblastoid cell line. Moreover, our stringent re- 
quirement for detecting such events (at least 2 RNA-seq 
reads with high base quality and mapping quality sup- 
porting editing) may have missed potential sites which 
are edited at very low frequency. 

It is interesting to relate our findings to the recent dis- 
cussions on substitutional RNA editing initiated by Li 
et al. [15]. Several technical comments on that study 
[19-21] pointed out that the mismatches of RNA-seq 
reads to the reference genome are almost exclusively at 
the ends of sequencing reads. The response by Li et al. 
[35] proposes that one of the reasons resulting in this 
bias is co-occurrence of substitutional RNA-DNA differ- 
ence sites with insertion/deletion RNA-DNA differences 
sites. Our results here indicate that such widespread 
indel RNA-DNA differences are unlikely to exist. Rather, 
our finding of false positives resulting from splice junc- 
tions that often combine apparent mismatches and 
indels seems to provide a possible explanation for the 
coexistence of mismatches and indels as well as their oc- 
currence at the end of the reads. Thus, our observation 
further questions the proposal of indel RNA-DNA mis- 
matches in [35] to explain the end effect of mismatches. 



The absence of indel RNA editing in our study also 
has to be discussed in the light of the previous study 
suggesting two potential insertional RNA editing sites in 
human [28]. This apparent discrepancy led us to speci- 
fically revisit the two insertional RNA editing sites iden- 
tified in [28]. Their work suggested that, a single uridine 
each inserts between A and G in the 5'UTRs of linker 
histone HI and high-mobility group (HMG) mRNA and 
creates new translation start sites and produces N- 
terminally extended proteins. Further examination of 
their study and our analysis allow us to propose several 
possible reasons for this discrepancy. 

First, as mentioned above, the editing events may not 
occur in the specific cell line we investigated. Moreover, 
the study in [28] showed that in certain cell types the 
abundance of the "edited" form of proteins is much 
lower than the normal form of proteins; thus, it is pos- 
sible that the coverage of RNA-seq data we used is not 
enough to detect the editing events which occur at a low 
frequency based on our filtering criteria. In fact, our 
alignment and filtering data show that only one 
RNA-seq read can be reliably aligned to the "AG" pos- 
ition in the 5' UTR of H 1.0 mRNA without insertion for 
both, BFAST and bowtie2, results. We note, that this is 
not due to the lack of a polyA tail on the histone H1.0 
mRNA when preparing the sequencing libraries, since 
the synthesis of histone H1.0 is not cell cycle-regulated 
and its mRNA is polyadenylated [36,37]. For the other 
case, HMGN1, around 10 reads can be reliably mapped 
and none of them contain the insertion site. This indi- 
cates that the read coverage at these two sites may be 
not sufficient to identify the "edited" version of the RNA 
(according to [28], for hl.0, 11 of 301 EST sequences 
support the "edited" version; while for hmgnl, only one 
EST sequence supports the "edited" version). 

However, the results of our filtering analysis promote us 
to also consider the possibility that these apparent "inser- 
tional editing sites" could be signatures of other biological 
"artifacts", such as a paralog sequence or splice junction. 
We thus examined if these sites could result from paralog 
sequences in the genome but found no such paralog se- 
quence. However, when examining the EST data which 
were used as evidence for editing sites ([28] found 301 
ESTs carrying the H1.0 5'UTR, 11 of which contain the U 
insertion, and only one EST sequence supporting a U in- 
sertion in HMGN1), one unusual property is observed: All 
of the 11 sequences for H1.0 have a very short 5' sur- 
rounding sequence (exclusively 3~4bp) at the "editing site" 
(see Figure 6). For two of the 11 sequences which have 
4 bp upstream of the editing site, the first base is "G" which 
does not match to genomic version "C". For HMGN1, the 
only supported sequence contains several mismatches at 
the upstream surrounding sequence between the genomic 
sequence and the EST sequence. 
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ET-H1.0 
H1.0 

ET-HMGN1 
HMGN1 



B 

BX397992 
BX432450 
BX418456 
BX371207 
BX419072 
BX386531 
BX432449 
BX426875 
BX328722 
BX410653 
BX410654 



3' splice junction ? 



imci 



EST 




GGAHE CTGGGAAAAGGGAGGCAG... BX328722 
. . JGCGdAd CTGGGAAAAGGGAGGCAG. . . BX494651 
GGATGCTCGGGCGGCGGGAGGAG. . . CV807291 
GC^CTCGGGCGGCGGGAGGAG... BQ887780 

3' splice junction ? 

GGlATGCTGGGAAAAGGGAGGCAGAGGAGGCGGAGGCAGAGGCAGAGGCAGAGCCCGGTG 5 9 
GG kTGCTGGGAAAAGGGAGGCAGAGGAGGCGGAGGCAGAGGCAGAGGCAGAGCCCGGTG 5 9 
GG kTGCTGGGAAAAGGGAGGCAGAGGAGGCGGAGGCAGAGGCAGAGGCAGAGCCCGGTG 5 9 
GG kTGCTGGGAAAAGGGAGGCAGAGGAGGCGGAGGCAGAGGCAGAGGCAGAGCCCGGTG 5 9 

GGG^TGCTGGGAAAMGGGAGGCAGAGGAGGCGGAGGCAGAGGCAGAGGCAGAGCCCGGTG 60 
GG ATGCTGGGAAAAGGGAGGCAGAGGAGGCGGAGGCAGAGGCAGAGGCAGAGCCCGGTG 5 9 
GG MGCTGGGAAAAGGGAGGCAGAGGAGGCGGAGGCAGAGGCAGAGGCAGAGCCCGGTG 5 9 

GGG Z^TGCTGGGAAAARGGAGGCAGAGGAGGCGGAGGCAGAGGCAGAGGCAGAGCCCGGTG 60 
GG A.TGCTGGGAAAAGGGAGGCAGAGGAGGCGGAGGCAGAGGCAGAGGCAGAGCCCGGTG 5 9 
GG &TGCTGGGAAAAGGGANGCAGAGGAGGCGGAGGCAGAGGCAGAGGCAGAGCCCGGTG 5 9 
GG MGCTGGGAAAAGGGAGGCAGAGGAGGCGGAGGCAGAGGCAGAGGCAGAGCCCGGTG 5 9 



Figure 6 Proposed explanation for previously identified insertional RNA editing sites in human. Based on our indel analysis of false 
positives and EST data in [28], we hypothesize that the two "insertional RNA editing sites" may be signatures of novel splice junctions. For h 1.0, the 
base "G" in the EST sequence is a mismatch compared to the genomic base "C" which is present in all EST sequences which have 4 bp upstream of 
the "editing site". For HMGN1, the only one EST sequence which supports "editing" also displays a mismatch ("G" in the genome and "C" in the EST 
sequence) upstream of the "editing site". The "AG" marker in the red box in the H1.0 and HMGN1 sequence may serve as an unknown 3' splice site. 



This is very similar to the pattern observed in our 
"splice junction" false positives in which "indels" occur 
close to the end of the alignment and coexist with mis- 
matches. This observation thus may imply that it may have 
resulted from rare and so far unknown splicing events. 
Again, we note that histone 1.0 belongs to replication- 
independent histone mRNAs [36,37] and thus could in 
principle be spliced even though no such splice variant has 
been documented so far. Moreover, the original study [28] 
indicated that the "edited" form of H 1.0 protein colocalizes 
with splicing speckles which may suggest a connection to 
splicing. Since [28] did not directly sequence the DNA and 
corresponding RNA surrounding the "editing sites", care- 
ful examination revealed that splicing can also explain all 
the additional experimental observations in their study, 
i.e., an extended protein form, restriction enzyme diges- 
tion, etc. Therefore, it is possible that these only two 
"insertional RNA editing sites" so far are indeed results 
of novel splicing events, which would require experi- 
mental verification. 



Table 2 First round RNA-seq data used in this study 



SRA accession 


Run used in this study 


number 




SRX000565 (33bp) 


SRR002055, SRR002063, SRR005091, SRR005096 


SRX000566 (33bp) 


SRR002052, SRR002054, SRR002060 


Second round RNA-seq Data used in this study 



SRX082145 SRR306998, SRR306999, SRR307000, SRR307001, 

SRR307002, SRR307003, SRR307004 



Conclusion 

In this study, we systematically examined the possibility 
of genome-wide indel RNA-DNA differences in one 
human individual, NA12878, by aligning several RNA- 
seq datasets to the corresponding assembled diploid 
genome from the same cell line. The initial selection 
revealed a number of indel candidates; however, following 
analysis showed that all of them are unrelated to RNA 
editing. Overall, our study suggests that the previously 
proposed insertional RNA editing events are unlikely to 
exist in the human transcriptome and that to obtain high 
confidence RNA-DNA difference results, it is necessary to 
build a robust computational filter pipeline when analy- 
zing high throughput sequencing data. 



Table 3 Index sets in BFAST used in this study 



Index number Mask 



1 111111111111111111 


2 


111010001110001110100011011111 


3 


11110100110111101010101111 


4 


11111111111111001111 


5 


1111011101100101001111111 


6 11110111000101010000010101110111 


7 


1011001101011110100110010010111 


8 1110110010100001000101100111001111 


9 


1111011111111111111 


10 


11011111100010110111101101 
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Methods 

Reference genome and RNA-seq reads (Data sources) 

The assembled diploid genome sequences of the indivi- 
dual NA12878, genome annotations and corresponding 
variants information (between the maternal and paternal 
sequences and the reference genome NCBI36/hgl8) were 
downloaded from http://sv.gersteinlab.org/NA12878_di- 
ploid/. Illumina generated RNA-seq data from the same 
cell line (lymphoblastoid cell line GM12878) as the one 
used for assembly of the diploid genome sequences of the 
individual NA12878 were downloaded from the NCBI Se- 
quence Read Archive (SRA). In order to verify the robust- 
ness of our results with respect to sequencing parameters 
such as read length, other single end RNA-seq datasets 
were downloaded from NCBI GEO (GEO accession num- 
ber GSE30401). The latter RNA-Seq data sets were gene- 
rated as part of the ENCODE Project [38]. Table 2 
provides the exact identifiers of the data sets. 

Mapping RNA-seq reads to the corresponding 
reference genome 

RNA-seq reads were first aligned to the maternal- 
derived haploid genome sequences using the standard 
pipeline of the Blat-like Fast Accurate Search Tool 
(BFAST) [31]. As one of the computational tools that is 
capable of discovering indels with gapped local align- 
ment, BFAST is very sensitive to errors, SNPs and espe- 
cially indels with a considerable mapping speed [31]. 
Specifically, ten indexes recommended for aligning reads 
with length less than 40bp in the BFAST manual (see 
Table 3) were used to index the reference maternal hap- 
loid genome. 

Most of the parameters in the alignment process were 
set to their default values. A single lookup is ignored if it 
returns more than K=8 candidate alignment locations 
(CALs); the maximum number of CALs for a read was 
M=1280. Local alignments were performed for each 
CAL using default settings and nucleotide substitutions, 
insertions and deletions were identified in the gapped 
alignment. Alignments were prioritized by alignment 
score and only the highest scoring alignment for each 
mapping read was output. The mapping output was set 
to SAM format. 

Post-processing of mapping output and variant calling 

To identify RNA editing sites, the output RNA-seq align- 
ment files in SAM format were processed by the publicly 
available, open source SAMTools software package [33] 
(http://samtools.sourceforge.net/) for variant calling. The 
version we used in this study was samtools.0.1.17. 

Using SAMTools, the output SAM files were first con- 
verted to their binary versions (BAM files) and then 
these BAM files were sorted and indexed for rapid 
lookup. The sorted BAM files were further processed in 



the variant calling step: using the "mpileup" function in 
SAMTools, indexed reference sequences and position 
sorted bam alignment files generated files with read in- 
formation at sites where mismatches and indels from 
the reference sequence were detected. Then, only infor- 
mation for indel differences was kept, while reads that 
contained only mismatches were discarded. The output 
file after this step served as the starting dataset for the 
indel RNA editing analysis. 

Initial filtering of indel variants 

To eliminate false positive results due to sequencing and 
reverse transcription errors, the initially identified indel 
variants were further filtered by the following criteria: 

1. Base quality filter: remove bases at the indel site with 
a sequencing quality score below 20. 

2. Mapping quality filter: remove reads with a mapping 
quality score below 20; discard a read if the indel 
position is within 2bp of the 5' end or 3' end; discard 
an indel-containing read if more than 3 mismatches 
are present. 

3. Coverage depth filter: remove candidates with less 
than 2 indel-containing nonduplicated reads; remove 
candidates with less than 5 reads; remove candidates 
with less than 5% indel-containing reads of the total 
covering reads. 

4. Variant quality: remove candidates with QUAL 
Phred-score of variant calling below 0.01. 

5. Indel type and size filter: remove variant sites that 
display more than one nonreference alleles as well as 
variant sites that contain any uncertain bases ("N"); 
only keep candidates with only one nucleotide 
difference from the genomic DNA (i.e., indel size 
should be one); remove variant sites that display 
homopolymer runs of more than 5 identical 
nucleotides. 
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